Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A complete printed Bangla OCR system

Identifieur interne : 002162 ( Main/Exploration ); précédent : 002161; suivant : 002163

A complete printed Bangla OCR system

Auteurs : Bidyut Baran Chaudhuri [Inde] ; U. Pal [Inde]

Source :

RBID : ISTEX:B3F3AD24F335762C9980E00780C64DE3A4CEE2F3

Descripteurs français

English descriptors

Abstract

A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language. In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques. From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification. The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier. The compound characters are recognized by a tree classifier followed by template-matching approach. The feature detection is simple and robust where preprocessing like thinning and pruning are avoided. The character unigram statistics is used to make the tree classifier efficient. Several heuristics are also used to speed up the template matching approach. A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well. For single font clear documents 95.50% word level (which is equivalent to 99.10% character level) recognition accuracy has been obtained. Extension of the work to Devnagari, the third most popular script in the world, is also discussed.

Url:
DOI: 10.1016/S0031-3203(97)00078-2


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>A complete printed Bangla OCR system</title>
<author>
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B" last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation>
<country>Inde</country>
<placeName>
<settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pal, U" sort="Pal, U" uniqKey="Pal U" first="U" last="Pal">U. Pal</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B3F3AD24F335762C9980E00780C64DE3A4CEE2F3</idno>
<date when="1998" year="1998">1998</date>
<idno type="doi">10.1016/S0031-3203(97)00078-2</idno>
<idno type="url">https://api.istex.fr/document/B3F3AD24F335762C9980E00780C64DE3A4CEE2F3/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000029</idno>
<idno type="wicri:Area/Istex/Curation">000029</idno>
<idno type="wicri:Area/Istex/Checkpoint">001666</idno>
<idno type="wicri:doubleKey">0031-3203:1998:Chaudhuri B:a:complete:printed</idno>
<idno type="wicri:Area/Main/Merge">002279</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:98-0263610</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000890</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B07</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000860</idno>
<idno type="wicri:doubleKey">0031-3203:1998:Chaudhuri B:a:complete:printed</idno>
<idno type="wicri:Area/Main/Merge">002454</idno>
<idno type="wicri:Area/Main/Curation">002162</idno>
<idno type="wicri:Area/Main/Exploration">002162</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">A complete printed Bangla OCR system</title>
<author>
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B" last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, 203 B. T. Road, Calcutta 700 035</wicri:regionArea>
<wicri:noRegion>Calcutta 700 035</wicri:noRegion>
<placeName>
<settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pal, U" sort="Pal, U" uniqKey="Pal U" first="U" last="Pal">U. Pal</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, 203 B. T. Road, Calcutta 700 035</wicri:regionArea>
<wicri:noRegion>Calcutta 700 035</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Pattern Recognition</title>
<title level="j" type="abbrev">PR</title>
<idno type="ISSN">0031-3203</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="1997">1997</date>
<biblScope unit="volume">31</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="531">531</biblScope>
<biblScope unit="page" to="549">549</biblScope>
</imprint>
<idno type="ISSN">0031-3203</idno>
</series>
<idno type="istex">B3F3AD24F335762C9980E00780C64DE3A4CEE2F3</idno>
<idno type="DOI">10.1016/S0031-3203(97)00078-2</idno>
<idno type="PII">S0031-3203(97)00078-2</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Corrector</term>
<term>Decision rule</term>
<term>Document processing</term>
<term>Error detection</term>
<term>Expert system</term>
<term>Language processing</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Segmentation</term>
<term>System performance</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Correcteur</term>
<term>Détection erreur</term>
<term>Langage naturel</term>
<term>Performance système</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance optique caractère</term>
<term>Règle décision</term>
<term>Segmentation</term>
<term>Système expert</term>
<term>Traitement document</term>
<term>Traitement langage</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language. In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques. From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification. The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier. The compound characters are recognized by a tree classifier followed by template-matching approach. The feature detection is simple and robust where preprocessing like thinning and pruning are avoided. The character unigram statistics is used to make the tree classifier efficient. Several heuristics are also used to speed up the template matching approach. A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well. For single font clear documents 95.50% word level (which is equivalent to 99.10% character level) recognition accuracy has been obtained. Extension of the work to Devnagari, the third most popular script in the world, is also discussed.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Inde</li>
</country>
<region>
<li>Bengale-Occidental</li>
</region>
<settlement>
<li>Calcutta</li>
</settlement>
<orgName>
<li>Institut indien de statistiques</li>
</orgName>
</list>
<tree>
<country name="Inde">
<region name="Bengale-Occidental">
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B" last="Chaudhuri">Bidyut Baran Chaudhuri</name>
</region>
<name sortKey="Pal, U" sort="Pal, U" uniqKey="Pal U" first="U" last="Pal">U. Pal</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002162 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002162 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:B3F3AD24F335762C9980E00780C64DE3A4CEE2F3
   |texte=   A complete printed Bangla OCR system
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024